perf(fmt): speed up file diffing#30644
Conversation
This swaps out dissimilar for imara which is substantially faster at diffing strings. Note that this is a proof of concept and I did not have enough time to make the output pretty. I just shows that the diff is fast. Applying colors should be doable without changing much about the perf. Fixes denoland#30634
| dprint-plugin-typescript = "=0.95.11" | ||
| env_logger = "=0.11.6" | ||
| fancy-regex = "=0.14.0" | ||
| imara-diff = "=0.2.0" |
There was a problem hiding this comment.
I'm willing to fix this up if I get an OK about the general direction.
I think this sounds good. I kind of wonder if there's a diffing library that allows bailing after X many differences though as it would work well for incredibly large files. I wonder if we could contribute that to dissimilar and if they'd take a patch that does that (maybe it's not too difficult?).
There was a problem hiding this comment.
I opened dtolnay/dissimilar#21 -- it might be more worthwhile to pursue this path than rewrite to imara-diff, which still might not be fast enough with very large files.
There was a problem hiding this comment.
I'm willing to fix this up if I get an OK about the general direction.
I think this sounds good. I kind of wonder if there's a diffing library that allows bailing after X many differences though as it would work well for incredibly large files. I wonder if we could contribute that to dissimilar and if they'd take a patch that does that (maybe it's not too difficult?).
I think both dissimilar and imara expose an iterator over the patches, so I would assume that we can just stop iterating and thereby abort the computation of the diff early.
I have yet to check if my assumption is correct, though.
There was a problem hiding this comment.
Is there a reason beyond the cost of migration why you'd like to stay with dissimilar? From my superficial understanding, it looks like imara is simply a better (=faster) diffing lib in all respects that are relevant for deno.
There was a problem hiding this comment.
Is there a reason beyond the cost of migration why you'd like to stay with dissimilar?
We know the diff output of dissimilar is ok, but not sure yet about imara. Generally diffs are only shown in error cases so perf doesn't matter too much, but obviously several minutes is not acceptable 😅. How much faster is imara for this diff? I guess if it's fast enough on this case then maybe that's good enough and we don't need to worry about doing some iterator or max results approach.
There was a problem hiding this comment.
On my machine, deno 2.5.0 needs around 36 minutes (!) to check the formatting of the file and print out the diff. My branch (still in debug build, did not compile with -r yet) cuts it down to 0.4 seconds.
I did not try larger files using Deno 2.5.0 but I tried them with this branch. The results are as follows:
- 1 MB file: 0.4 seconds
- 10 MB file: 4 seconds
- 100 MB file: 40 seconds
(not evaluated this very scientifically, please take it with a grain of salt)
All files had a similar format as shown #30634.
There was a problem hiding this comment.
I have started working on bringing back properly formatted diffs. One thing I have noticed is that imara is extremely good at finding line diffs, but it does not have built-in word-diffing (see pascalkuthe/imara-diff#1). I will ask if they accept contributions, but otherwise I'm afraid we will have to add the complexity here. This is something that dissimilar provides out of the box, but they seem to do it by not even tokenizing the input at all, which explains why it is so slow. (Note that this also means that imara might get a lot slower once we run word diffs with it.)
There was a problem hiding this comment.
Upstream work continues: pascalkuthe/imara-diff#33
There was a problem hiding this comment.
If my upstream work gets merged, the debug build of imara will be able to diff this same string is less then 1 second including the full word diff of the file (down from 36 minutes on Deno's main branch built in release mode)
There was a problem hiding this comment.
Pull Request Overview
This PR replaces the dissimilar diff library with imara-diff for computing text diffs in the resolver display module. The change updates the diff algorithm implementation while maintaining the same output format.
Key Changes:
- Switched from
dissimilartoimara-difflibrary for diff computation - Updated
DiffBuilderimplementation to work with imara-diff's hunk-based API - Deferred implementation of context line handling (marked with TODO comments)
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| Cargo.toml | Added imara-diff = "=0.2.0" to workspace dependencies |
| Cargo.lock | Updated lock file with imara-diff dependency and its transitive dependencies (hashbrown, memchr) |
| libs/resolver/Cargo.toml | Replaced dissimilar with imara-diff in resolver crate dependencies |
| libs/resolver/display.rs | Rewrote diff computation logic to use imara-diff's InternedInput and hunk-based API; added lifetime parameter to DiffBuilder; commented out unused helper functions fmt_add_text and fmt_rem_text |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThis PR migrates the diff implementation from the dissimilar crate to imara_diff. The changes include adding imara-diff as a new dependency in the CLI crate's Cargo.toml, replacing dissimilar with imara-diff as a workspace dependency in libs/resolver, and refactoring display.rs to use imara_diff's API. The refactoring swaps out chunk-based diff handling for a histogram algorithm with hunk-based processing, introduces a lifetime-parameterized DiffBuilder with InternedInput, and restructures the diff rendering pipeline. Additionally, line-ending normalization is enhanced to detect and report CRLF differences as a user-visible note. Sequence Diagram(Not applicable — the changes are primarily a dependency and internal implementation swap without new external control flow.) Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
libs/resolver/display.rs (1)
67-111: Multiple hunks are merged into one block and line numbers become incorrect
handle_diffappends all hunks’ removed/added lines intoself.orig/self.editand only callsflush_changes()once at the end. That has two concrete effects:
- Separate change regions (distinct hunks from
diff.hunks()) are rendered as a single diff block instead of separate regions.orig_line/edit_linenever get updated per hunk, so all lines in the combined block are numbered as if they started at line 1, which is wrong for files with multiple distant changes.Given imara-diff’s
Hunk.before/Hunk.afterare 0-based index ranges intoinput.before/input.after(one token per line), you can restore correct behavior by flushing per hunk and resetting the starting line numbers from those ranges. For example:- fn handle_diff(mut self, diff: Diff) -> String { - for hunk in diff.hunks() { - if !hunk.before.is_empty() || !hunk.after.is_empty() { - self.has_changes = true; - } - // collect all removed lines - for (i, del) in hunk.before.enumerate() { - let s = self.input.interner[self.input.before[del as usize]]; - if i > 0 { - self.orig.push('\n'); - } - self.orig.push_str(&fmt_rem_text_highlight(s)); - } - // collect all added lines - for (i, ins) in hunk.after.enumerate() { - let s = self.input.interner[self.input.after[ins as usize]]; - if i > 0 { - self.edit.push('\n'); - } - self.edit.push_str(&fmt_add_text_highlight(s)); - } - } - - self.flush_changes(); - self.output - } + fn handle_diff(mut self, diff: Diff) -> String { + let mut first = true; + + for hunk in diff.hunks() { + if hunk.before.is_empty() && hunk.after.is_empty() { + continue; + } + + // flush previous hunk’s block before starting a new one + if !first { + self.flush_changes(); + } else { + first = false; + } + + // reset starting line numbers from absolute token indices + self.orig_line = hunk.before.start as usize + 1; + self.edit_line = hunk.after.start as usize + 1; + self.has_changes = true; + + // collect removed lines for this hunk + for (i, del) in hunk.before.enumerate() { + let s = self.input.interner[self.input.before[del as usize]]; + if i > 0 { + self.orig.push('\n'); + } + self.orig.push_str(&fmt_rem_text_highlight(s)); + } + + // collect added lines for this hunk + for (i, ins) in hunk.after.enumerate() { + let s = self.input.interner[self.input.after[ins as usize]]; + if i > 0 { + self.edit.push('\n'); + } + self.edit.push_str(&fmt_add_text_highlight(s)); + } + } + + self.flush_changes(); + self.output + }This keeps the current no-context behavior but restores one block per hunk with correct absolute line numbers for each change region.
🧹 Nitpick comments (1)
libs/resolver/display.rs (1)
259-315: Add a test covering multiple separated change regionsThe current tests validate single-hunk cases and line-ending normalization, but don’t exercise a file with two or more non-adjacent change regions. That’s exactly where the current
handle_diffimplementation misbehaves (merged hunks and wrong line numbers).Consider adding a test that changes, say, lines 5 and 20 in a 20+ line file and asserts that the diff output shows two separate blocks with the expected line numbers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
Cargo.toml(1 hunks)libs/resolver/Cargo.toml(1 hunks)libs/resolver/display.rs(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: test debug linux-aarch64
- GitHub Check: test debug macos-x86_64
- GitHub Check: test debug windows-x86_64
- GitHub Check: test debug linux-x86_64
- GitHub Check: test release linux-x86_64
- GitHub Check: test debug macos-aarch64
- GitHub Check: lint debug windows-x86_64
- GitHub Check: lint debug macos-x86_64
- GitHub Check: build libs
- GitHub Check: lint debug linux-x86_64
🔇 Additional comments (3)
Cargo.toml (1)
315-315: Addingimara-diffworkspace dependency looks consistentPinned version and placement match the existing dependency style; no issues from the workspace wiring side.
libs/resolver/Cargo.toml (1)
54-54: Resolver’s switch toimara-diffvia workspace dep is coherentThe new
imara-diff.workspace = trueentry correctly matches the implementation indisplay.rsand the workspace dependency added in the rootCargo.toml.libs/resolver/display.rs (1)
9-29: imara-diff integration and CRLF handling look correctUsing
InternedInput::new+Diff::compute(Algorithm::Histogram, &input)followed bypostprocess_lines(&input)matches imara-diff’s recommended text-diff usage, and the newline normalization + “Text differed by line endings.” fast-path preserves and clarifies the old CRLF-only-diff behavior exercised intest_newlines_differing.Also applies to: 31-50
Simplify DiffBuilder to directly write lines per hunk instead of buffering. This fixes: - Missing flush between hunks causing all changes to merge into one block - Incorrect line numbers when unchanged lines exist between hunks - Token trailing newlines being duplicated in output Also fix pre-existing sys_traits dev-dependency missing getrandom feature and remove dead commented-out code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kajukitli
left a comment
There was a problem hiding this comment.
not approving this as-is
it definitely makes the implementation smaller/faster, but it also regresses the output semantics in a pretty user-visible way.
before, the diff grouped adjacent delete/insert chunks together and kept unchanged context lines in the output shape. with this version you're only iterating diff.hunks() and emitting removed/added lines directly, so the output is much flatter and loses some of the structure that made the formatter diff readable.
you can see it in the adjusted tests:
- the multiline case no longer keeps the old/new
console.log(pairing together in the same way - the single-line case drops the trailing blank added line entirely
for a formatter diff, readability matters more than raw throughput. if the goal is "proof of concept", then yeah the direction seems plausible, but i'd want the old output quality preserved before merging.
Address review feedback: show unchanged context lines between diff hunks so that related changes remain visually connected (e.g. console.log( appearing as both -/+ between surrounding changes). Context lines use non-highlighted colors to distinguish them from actual changes. Only lines immediately preceding a hunk are shown as context to avoid noise. Also update Cargo.lock for the getrandom feature addition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kajukitli
left a comment
There was a problem hiding this comment.
rechecked after the follow-up commits
this is closer, but i still wouldn't merge it yet. the main semantic regression is still there: trailing newline-only insertions/deletions are getting collapsed away because write_*_line() strips the final \n and only emits one logical line per interned line token.
that's exactly why the test had to change from:
2 | +some line text test
3 | +
to just:
2 | +some line text test
for formatter diffs that last blank line is real output and users do care about it. same problem shows up in the multiline test churn too.
so yeah, the perf direction still seems good, but i think the diff renderer needs to preserve newline-only hunks before this is ready.
Interleave deleted and inserted lines within each hunk so that corresponding old/new line pairs appear adjacent in the output. This matches the expected format in spec tests (e.g. gitignore) and preserves readability for formatter diffs. Also fix rustfmt formatting in test code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove context lines between diff hunks that were causing spec test failures. Remove unused prev_after_end variable and dead code (write_context_line, fmt_rem_text, fmt_add_text). Update test expectations since imara-diff correctly identifies unchanged lines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The imara-diff algorithm matches identical lines (like closing braces) as unchanged context rather than delete/insert pairs, producing more minimal diffs. Update .out files to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kajukitli
left a comment
There was a problem hiding this comment.
looked again and my take hasn't changed
the perf angle is still good, but the renderer is still stripping structural newline information:
let text = text.strip_suffix('\n').unwrap_or(text);both write_rem_line() and write_add_line() do that, so newline-only changes still disappear from the rendered diff. that's why the old test case with the explicit blank added line is still gone.
for formatter diffs, losing a trailing blank line change is not cosmetic — it's a real output regression. i'd still keep this in "good direction, not merge-ready" territory until blank-line/newline-only hunks are preserved.
Add "\ No newline at end of file" marker (matching git convention)
when a line lacks a trailing newline, so that trailing-newline
changes are visible in the rendered diff instead of silently
disappearing after strip_suffix('\n').
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kajukitli
left a comment
There was a problem hiding this comment.
Reviewed the changes. Initial findings didn't hold up on closer inspection. Looks good.
|
The upstream crate so far only supports word-diffing. That's a regression from the current implementation of Deno. You need to do one of the following:
|
Emit a `...` separator line between hunks that have gaps (skipped unchanged lines), making it clear when diff output jumps across non-adjacent lines. Also remove unnecessary getrandom feature from dev-dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dsherret
left a comment
There was a problem hiding this comment.
LGTM! I think just word diffing is probably fine?
Benchmark:
|
| File | Lines | Size | Stable (avg) | Canary (avg) | Speedup |
|---|---|---|---|---|---|
bad_500.ts |
1,152 | 36 KB | 756ms | 198ms | 3.8x |
bad_2000.ts |
4,545 | 144 KB | 8,859ms | 299ms | 29.6x |
bad_5000.ts |
11,345 | 356 KB | (too slow) | ~300ms est. | >>30x |
The stable build became unusably slow on the 5,000+ line files (was still running after minutes), while canary stayed under 350ms. The dissimilar → imara-diff swap is a massive improvement, especially as file size grows — looks like the old algorithm scaled super-linearly.
Benchmark script
Generated badly-formatted TS files with inconsistent spacing, nested objects, switch statements, classes, etc. Then ran:
deno fmt --check bad_${size}.tsTimed with millisecond precision over 10 iterations, reporting min/avg/max.
Versions tested:
- Canary: deno 2.7.5+5b7680f (canary, release, aarch64-apple-darwin)
- Stable: deno 2.7.5 (stable, release, aarch64-apple-darwin)
This swaps out dissimilar for imara which is substantially faster at diffing strings.
Note that this is a proof of concept and I did not have enough time to make the output pretty. I just shows that the diff is fast. Applying colors should be doable without changing much about the perf. I'm willing to fix this up if I get an OK about the general direction.
Fixes #30634